2025-12-03
Given \(n\) data points \(\{x_i, y_i\}_{i=1}^n\)
Given \(n\) data points \(\{x_i, y_i\}_{i=1}^n \xrightarrow{\text{Predict}} (x_{\texttt{new}}, y_{\texttt{new}})\)
\(\{x_i, y_i\}_{i=1}^n \xrightarrow{\text{Predict}} (x_{\texttt{new}}, y_{\texttt{new}})\)
Can all be done with Gaussian distributions!
If we have a vector of random variables \(\mathbf{x}\) and
\[ \mathbf{x} \sim \mathcal{N}_d(\boldsymbol{\mu}, \mathbf{\Sigma}) \]
then, the joint probability mass of \(\mathbf{x}\) is given by the multivariate normal:
\[ p\left( \mathbf{x} \,|\, \boldsymbol{\mu}, \mathbf{\Sigma} \right) \propto \exp \left\{ -\frac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu}) \right\} \]
A two-dimensional MVN with \(\boldsymbol{\mu}=[0,0]\) and \(\Sigma=\begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix}\):
# mean vector
mu <- c(0, 0)
# covariance matrix
Sigma <- matrix( c(1, 0.7,
0.7, 1), nrow = 2)
# grid of x1 and x2 values
x1x2_grid <- expand.grid(x1 = seq(-3, 3, length.out = 100),
x2 = seq(-3, 3, length.out = 100))
# probability contours
probabilities <- dmvnorm(x1x2_grid, mean = mu, sigma = Sigma)
Probability contour plot:
Probability contour plot:
We can condition on one of the variables, \(p(x_2 \,|\, x_1,\,\Sigma)\)
\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]
\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]
\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]
\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]
\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]
\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]
\[ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.7\\ 0.7 & 1 \end{bmatrix} \]
\[ \mathbf{\Sigma} = \begin{bmatrix} 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 & 0.03 & 0.01 & 0.00 & 0.00 \\ 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 & 0.03 & 0.01 & 0.00 \\ 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 & 0.03 & 0.01 \\ 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 & 0.03 \\ 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 & 0.08 \\ 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 & 0.20 \\ 0.03 & 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 & 0.41 \\ 0.01 & 0.03 & 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 & 0.67 \\ 0.00 & 0.01 & 0.03 & 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 & 0.90 \\ 0.00 & 0.00 & 0.01 & 0.03 & 0.08 & 0.20 & 0.41 & 0.67 & 0.90 & 1.00 \end{bmatrix} \]
In a simple example, imagine we are given three data points \(\{(x_i, y_i)\}_{i=1}^3\). We need to predict the function thereafter
A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables.
A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
As a distribution over functions, a Gaussian process is completely specified by two functions:
\[ f(\mathbf{x}) \sim \mathcal{GP}\big(\, m(\mathbf{x}), k(\mathbf{x}, \mathbf{x^\star}) \, \big) \]
Generative model
\[ y(\mathbf{x}) = f(\mathbf{x}) \Big[ + \epsilon\sigma_y \Big]\\ p(\epsilon)=\mathcal{N}(0,1) \]
Place GP prior over the nonlinear function (mean function often taken as 0).
\[ p(f(\mathbf{x}) \, | \, \theta) = \mathcal{GP}\big(0, k(\mathbf{x}, \mathbf{x^\star})\big)\\ k(\mathbf{x}, \mathbf{x^\star}) = \sigma^2 \exp \left\{ -\frac{1}{2\ell^2}(x-x^\star)^2 \right\} \]
\[ p(y_1, y_2) = \mathcal{N} \left( \begin{bmatrix} \boldsymbol{\mu_1\\ \mu_2} \end{bmatrix}, \begin{bmatrix} \mathbf{\Sigma}_{11} & \mathbf{\Sigma}_{12}\\ \mathbf{\Sigma}_{21} & \mathbf{\Sigma}_{22} \end{bmatrix}\right) \] Bayes’ theorem:
\[ p(\mathbf{y}_1 \,|\, \mathbf{y}_2) = \frac{p(\mathbf{y}_1, \mathbf{y}_2)}{p(\mathbf{y}_2)} \]
With an involved proof:
In Bayesian optimization, a surrogate function (usually called an acquisition function) is used to select the next best evaluation location
Steps:
Explore points with high variance.
Exploit points most likely to be minimum.
\[ I(x_{\texttt{next}}) = \max\left\{0, f(x_M^\star) - \widetilde{f}(x_{\texttt{next}}) \right\} \]
Each point \(x_{\texttt{next}}\) in a Gaussian process is associated with a mean given by evaluating the mean function at \(x_{\texttt{next}}\), \(\mu(x_{\texttt{next}})\) and a variance given by evaluating the variance function at \(x_{\texttt{next}}\), \(k(x, x_{\texttt{next}})\).
\[ \lambda(y) := \begin{cases} y;\quad y < \eta \\ \eta;\quad y \ge \eta \end{cases} \]
Fancy way of saying loss is \(\min(y, \eta)\).
\[ \mathbb{E}[\lambda(y)] := \int_{y=-\infty}^{y=\infty} \lambda(y)\cdot p(y \,| \,x, I_0) \, dy \]
where \(x\) is the new selected point and \(I_0\) is the current “knowledge” (e.g., Gaussian process parameters).
Our integral must be split at the critical value \(\eta\). When \(y<\eta\), \(\lambda(y)=y\) and when \(y\ge\eta\), \(\lambda(y)=\eta\).
\[ \mathbb{E}[\lambda(y)] := \int_{y=-\infty}^{y=\eta} y\cdot p(y \,| \,x, I_0) \, dy + \int_{y=\eta}^{y=\infty} \eta\cdot p(y \,| \,x, I_0) \, dy \]
Note that \(p(y \,| \,x, I_0)\) is driven by the Gaussian process!
\[ p(y \,| \,x, I_0) = \mathcal{N}\big(y; m(y\,|\,I_0), k(y_0,y\,|\,I_0)\big) \]
Plugging this into the integrals, we get
\[ \begin{multline} \mathbb{E}[\lambda(y)] = \int_{-\infty}^{\eta} y\cdot \mathcal{N}\big(y; m(y\,|\,I_0), k(y_0,y\,|\,I_0)\big) \, dy \\ + \int_{\eta}^{\infty} \eta\cdot \mathcal{N}\big(y; m(y\,|\,I_0), k(y_0,y\,|\,I_0)\big) \, dy \end{multline} \]
\[ \begin{multline} =\eta + (m(y\,|\,I_0) - \eta) \Phi(\eta\,|\,m(y\,|\,I_0),k(y_0,y\,|\,I_0)) \\ - k(y_0,y\,|\,I_0)\phi(\eta\,|\,m(y\,|\,I_0),k(y_0,y\,|\,I_0) ) \end{multline} \]
GPGO was used to find the optima for several known, difficult functions.
Performances was measured using “gap”:
\[ G := \frac{y(x^{\texttt{first}}) - y(x^{\texttt{best}})}{y(x^{\texttt{first}}) - y^{\texttt{opt}}} \]
Journal article: Osborne et al. (2009)
Mathematics/derivations: Williams and Rasmussen (2006), Osborne et al. (2009)
Inspiration for some of the visualizations: Turner (2016), Gramacy (2022a), and Gramacy (2022b)
Bayesian optimization visualization: Farina (2024)
Farina, Prof. Gabriele. 2024. “Lecture 16 - Bayesian Optimization.” Massachusetts Institute of Technology (MIT), April 25. https://www.mit.edu/~gfarina/2024/67220s24_L16_bayesian_optimization/L16.pdf.
Gramacy, Robert B. 2022a. “Surrogate Modeling and Bayesian Optimization.” Virginia Tech, September 21. https://www.youtube.com/watch?v=0Jj9ikDxqf4.
Gramacy, Robert B. 2022b. “Surrogate Modeling and Bayesian Optimization (Part 2).” Virginia Tech, September 21. https://www.youtube.com/watch?v=JKp32HVxIQs.
Osborne, Michael A, Roman Garnett, and Stephen J Roberts. 2009. “Gaussian Processes for Global Optimization.” 3rd International Conference on Learning and Intelligent Optimization (LION3).
Turner, Richard. 2016. “Machine Learning Tutorial Series: Gaussian Processes.” Imperial College London, November 23. https://www.youtube.com/watch?v=92-98SYOdlY.
Williams, Christopher KI, and Carl Edward Rasmussen. 2006. Gaussian Processes for Machine Learning. Vol. 2. MIT press Cambridge, MA.